Identifying Errors in Russian Web Corpora

نویسندگان

چکیده

Abstract The explosion of the Web leads to production large amounts texts and inevitably influences their quality. Errors that tend occur more often can distort results, especially when are used for scientific purposes, in language teaching or learning. Hence, there is a need examine existing corpora based on web clean up data, which may contain such “noisy” fragments. In our study, we deal with problem errors analyze Aranea Russicum Maximum corpus. Among errors, name, above all, encoding incorrect font types, as well segments written other languages. These phenomena result morphological analysis lemmatization, frequency distortion, fact lexical units cannot be found therefore displayed corpus users. paper focuses describes types outlines possible ways eliminate them.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the ...

متن کامل

Generating Learner-Like Morphological Errors in Russian

To speed up the process of categorizing learner errors and obtaining data for languages which lack error-annotated data, we describe a linguistically-informed method for generating learner-like morphological errors, focusing on Russian. We outline a procedure to select likely errors, relying on guiding stem and suffix combinations from a segmented lexicon to match particular error categories an...

متن کامل

Annotation errors detection in TTS corpora

We investigate the problem of automatic detection of annotation errors in single-speaker read-speech corpora used for textto-speech (TTS) synthesis. Various word-level feature sets were used, and the performance of several detection methods based on support vector machines, extremely randomized trees, knearest neighbors, and the performance of novelty and outlier detection are evaluated. We sho...

متن کامل

Detecting Annotation Errors in Spoken Language Corpora

Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), more recently work has also started to address errors in syntactic and other...

متن کامل

Identifying Comparable Corpora Using LDA

Parallel corpora have applications in many areas of Natural Language Processing, but are very expensive to produce. Much information can be gained from comparable texts, and we present an algorithm which, given any bodies of text in multiple languages, uses existing named entity recognition software and topic detection algorithm to generate pairs of comparable texts without requiring a parallel...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Jazykovedný ?asopis

سال: 2022

ISSN: ['0021-5597', '1338-4287']

DOI: https://doi.org/10.2478/jazcas-2022-0021